A banking services company, with customers mainly in Europe.
Offers financial products such as bank account, investments and insurance.
Business Model: banking service through physical and online branches.
Main product: bank account free of charge, valid for 12 months. After that period, the account must be renewed.
Bank account revenue per customer:
Model performance and results report with the following topics:
What is the company’s current turnover rate?
Qual é a taxa de rotatividade atual da empresa?
How does the churn rate vary by month?
Como a taxa de churn varia por mês?
How does the model perform to label customers as churns?
Qual é o desempenho do modelo para prever os clientes como churns?
Qual é a receita da empresa, se ela evita-se que os clientes entrem no cancela-sem o contrato por meio do modelo desenvolvido?
Action Plan:
Plano de Ação :
library(dplyr)
library(data.table)
library(ggplot2)
library(plotly)
library(vcd)
library(grid)
library(tidymodels)
library(treesnip)
library(lightgbm)
ggplot2::theme_set(ggplot2::theme_minimal())
feature_engineering <- function(df){
df <- df %>% dplyr::mutate(tenure_year = tenure + 1,
age_ten_year = age / tenure_year,
cred_ten_year = credit_score / tenure_year,
cred_age = credit_score / age,
amount = estimated_salary + balance,
amount_credit = amount / credit_score,
amount_ten_year = amount /tenure_year,
amount_prod = amount / num_of_products,
cred_prod = credit_score / num_of_products,
bal_ten_year = balance / tenure_year,
prod_m_cr = num_of_products - has_cr_card,
prod_t_cr = num_of_products * has_cr_card)
}
catcor <- function(x, type=c("cramer", "phi", "contingency")) {
require(vcd)
nc <- ncol(x)
v <- expand.grid(1:nc, 1:nc)
type <- match.arg(type)
res <- matrix(mapply(function(i1, i2) assocstats(table(x[,i1],
x[,i2]))[[type]], v[,1], v[,2]), nc, nc)
rownames(res) <- colnames(res) <- colnames(x)
res
}
minmax_scaler <- function(x) {
return( ( x - min( x ) ) / ( max(x) - min(x) ) )
}
robust_scaler <- function(x){
return( ( x - quantile( x , 0.5) ) / ( quantile(x ,0.75) - quantile(x, 0.25) ) )
}
ml_error <- function(model_name = "Logistic Regression Model",model_predictions){
Accuracy <- model_predictions %>%
yardstick::precision( actual,predictions)
}
data_raw <- data.table::fread("/home/renato/repos/churn/data/churn.xls")
data_raw <- data_raw %>% janitor::clean_names()
head(data_raw)
row_number customer_id surname credit_score geography gender age tenure
1: 1 15634602 Hargrave 619 France Female 42 2
2: 2 15647311 Hill 608 Spain Female 41 1
3: 3 15619304 Onio 502 France Female 42 8
4: 4 15701354 Boni 699 France Female 39 1
5: 5 15737888 Mitchell 850 Spain Female 43 2
6: 6 15574012 Chu 645 Spain Male 44 8
balance num_of_products has_cr_card is_active_member estimated_salary
1: 0.00 1 1 1 101348.88
2: 83807.86 1 0 1 112542.58
3: 159660.80 3 1 0 113931.57
4: 0.00 2 0 0 93826.63
5: 125510.82 1 1 1 79084.10
6: 113755.78 2 1 0 149756.71
exited
1: 1
2: 0
3: 1
4: 0
5: 0
6: 1
RowNumber - The number of the row.
RowNumber - O numero de linhas.
CustomerID - Customer’s unique identifier.
CustomerID - Identificador único do cliente.
Surname - Customer’s surname.
Surname - Sobrenome do Cliente.
CreditScore - Customer’s credit score for the consumer market.
CreditScore - Pontuação de crédito do cliente para o mercado consumidor.
Geography - The country where the customer lives.
Geography - O país onde o cliente mora.
Gender - Customer’s gender.
Gender - Sexo do Cliente.
Age - Customer’s age.
Age - Idade do Cliente.
Tenure - Number of years that the customer was active.
Tenure - O numero de anos que Cliente esteve ativo.
Balance - The amount that the customer has in the bank account.
Balance - Saldo da Conta bancaria.
NumOfProducts - The number of products bought by the customer.
NumOfProducts - O número de produtos comprados pelo cliente.
HasCrCard - Flag that indicates if the customer has a credit card.
HasCrCard - Se o cliente possui cartão de credito.
IsActiveMember - Flag that indicates if the customer has done a bank activity in the last 12 months.
IsActiveMember - Sinalizador que indica se o cliente realizou uma atividade bancária nos últimos 12 meses.
EstimateSalary - Estimate customer’s monthly income.
EstimateSalary - Estimativa de renda mensal do Cliente.
Exited - Flag that indicates if the customer is in Churn.
Exited - Indica se cliente cancelou ou nao o contrato.
print(paste("Number of Rows: " ,nrow(data_raw)))
[1] "Number of Rows: 10000"
print(paste("Number of Cols: " ,ncol(data_raw)))
[1] "Number of Cols: 14"
glimpse(data_raw)
Rows: 10,000
Columns: 14
$ row_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ customer_id <int> 15634602, 15647311, 15619304, 15701354, 15737888, 15…
$ surname <chr> "Hargrave", "Hill", "Onio", "Boni", "Mitchell", "Chu…
$ credit_score <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 52…
$ geography <chr> "France", "Spain", "France", "France", "Spain", "Spa…
$ gender <chr> "Female", "Female", "Female", "Female", "Female", "M…
$ age <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, …
$ tenure <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, …
$ balance <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.7…
$ num_of_products <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2…
$ has_cr_card <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1…
$ is_active_member <int> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1…
$ estimated_salary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10,…
$ exited <int> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
colSums(is.na(data_raw))
row_number customer_id surname credit_score
0 0 0 0
geography gender age tenure
0 0 0 0
balance num_of_products has_cr_card is_active_member
0 0 0 0
estimated_salary exited
0 0
data_raw <- data_raw %>%
dplyr::mutate_if(is.character, as.factor)
# selecting only numeric features
num_attributes <- data_raw %>%
purrr::keep(is.numeric)
# selecting only categorical features
cat_attributes <- data_raw %>%
purrr::keep(is.factor)
# Central Tendency - mean , median
num_mean <- as.data.frame( t(lapply(num_attributes, mean)))
num_median <- as.data.frame( t(lapply(num_attributes, median)))
# dispersion - std, min, max, range, skew, kurtosis
num_std <- as.data.frame( t(lapply(num_attributes, sd)))
num_min <- as.data.frame( t(lapply(num_attributes, min)))
num_max <- as.data.frame( t(lapply(num_attributes, max)))
num_skew <- as.data.frame( t(lapply(num_attributes, e1071::skewness)))
num_kurt <- as.data.frame( t(lapply(num_attributes, e1071::kurtosis)))
table_desc <- t(bind_rows(num_min,num_max,num_mean,num_median,num_std,num_skew,num_kurt))
table_desc<- as.data.frame(table_desc)
names(table_desc) <- c("min","max","mean","median","std","skew", "kurtosis")
knitr::kable(table_desc, digits = 4) %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
| min | max | mean | median | std | skew | kurtosis | |
|---|---|---|---|---|---|---|---|
| row_number | 1 | 10000 | 5000.5 | 5000.5 | 2886.896 | 0 | -1.20036 |
| customer_id | 15565701 | 15815690 | 15690941 | 15690738 | 71936.19 | 0.001148801 | -1.196475 |
| credit_score | 350 | 850 | 650.5288 | 652 | 96.6533 | -0.07158513 | -0.4266275 |
| age | 18 | 92 | 38.9218 | 37 | 10.48781 | 1.011017 | 1.393171 |
| tenure | 0 | 10 | 5.0128 | 5 | 2.892174 | 0.01098816 | -1.16561 |
| balance | 0 | 250898.1 | 76485.89 | 97198.54 | 62397.41 | -0.1410664 | -1.489569 |
| num_of_products | 1 | 4 | 1.5302 | 1 | 0.5816544 | 0.7453442 | 0.581373 |
| has_cr_card | 0 | 1 | 0.7055 | 1 | 0.4558405 | -0.9015411 | -1.187342 |
| is_active_member | 0 | 1 | 0.5151 | 1 | 0.4997969 | -0.0604185 | -1.996549 |
| estimated_salary | 11.58 | 199992.5 | 100090.2 | 100193.9 | 57510.49 | 0.002084732 | -1.181891 |
| exited | 0 | 1 | 0.2037 | 0 | 0.4027686 | 1.471169 | 0.1643553 |
num_attributes %>%
purrr::keep(is.numeric) %>%
tidyr::gather() %>%
ggplot2::ggplot(ggplot2::aes(value)) +
ggplot2::facet_wrap(~ key, scales = "free") +
ggplot2::geom_histogram(col= "black", fill="steelblue", bins = 25)+
ggplot2::scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
ggplot2::labs(title = "Distribution of numerical variables")+
ggplot2::scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
data_raw %>%
mutate(exited = as.factor(exited)) %>%
dplyr::count(exited) %>%
mutate(prop = round(n/sum(n)*100,2)) %>%
ggplot2::ggplot(ggplot2::aes(x = exited, y = prop, color = exited)) +
ggsci::scale_color_jco() +
ggplot2::geom_segment(ggplot2::aes(xend = exited, yend = 0), show.legend = F) +
ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
fill = "white",
hjust = "inward",
show.legend = F) +
ggplot2::labs(y = "%",
x = "Exited") +
ggplot2::coord_flip() +
ggplot2::theme_minimal()
The company’s current cancellation rate is 20%.
A taxa de cancelamento atual da empresa é de 20%.
apply(cat_attributes, 2, function(x) length(unique(x)))
surname geography gender
2932 3 2
ggpubr::ggarrange(
data_raw %>%
dplyr::count(gender) %>%
mutate(prop = round(n/sum(n)*100,2)) %>%
ggplot2::ggplot(ggplot2::aes(x = gender, y = prop, color = gender)) +
ggsci::scale_color_jco() +
ggplot2::geom_segment(ggplot2::aes(xend = gender, yend = 0), show.legend = F) +
ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
fill = "white",
hjust = "inward",
show.legend = F) +
ggplot2::labs(y = "%",
x = "Gender", title = "Gender") +
ggplot2::coord_flip() +
ggplot2::theme_minimal()+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),
data_raw %>%
dplyr::count(geography) %>%
mutate(prop = round(n/sum(n)*100,2)) %>%
ggplot2::ggplot(ggplot2::aes(x = geography, y = prop, color = geography)) +
ggsci::scale_color_jco() +
ggplot2::geom_segment(ggplot2::aes(xend = geography, yend = 0), show.legend = F) +
ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
fill = "white",
hjust = "inward",
show.legend = F) +
ggplot2::labs(y = "%",
x = "Geography", title = "Geography") +
ggplot2::coord_flip() +
ggplot2::theme_minimal()+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),
ncol = 2)
Conclusions:
Conclusões:
The average age of customers is 38 years.
A média de idade dos clientes esta em 38 anos.
Balance has an almost normal distribution, with most customers between 100,000.00 and 130,000.00 euros in their bank account.
Balance possui uma distribuição quase normal, tendo na sua maioria clientes entre 100,000.00 a 130,000.00 euros em sua conta bancaria.
Most customers have an average score of 400 to 700.
A maioria dos clientes possui um score médio de 400 a 700.
Customers have an average of 100,000.00 euros per month.
Os clientes possuem em média 100,000.00 euros por mês.
Most customers have a credit card.
A maioria do clientes possui cartão de credito.
Customers who have and have not done banking in the past 12 months are almost balanced.
Os clientes que realizaram e nao realizaram um operação bancaria no ultimos 12 meses é quase equilibrado.
Most customers own at least one bank product.
A maioria dos clientes possuem pelo menos um produto do banco.
The average number of customers who are active in the bank is 5 years.
A média de clientes que estão ativos no banco é de 5 anos.
Only 20% is healthy churn.
Apenas 20% é são churn.
55% of customers are male.
55% dos clientes são masculino.
50% of customers are from France.
50% dos clientes são da frança.
num_attributes %>%
filter(balance == 0) %>%
count()
n
1: 3617
knitr::include_graphics("/home/renato/repos/churn/img/Churn.png")
1 Customers with higher wages should have a higher churn rate.
1 Clientes com salarios mais altos, devem ter um indice de churn maior.
2 Customers with a low level of satisfaction are more likely to churn.
2 Clientes com um nivel de satisfação baixo , tendem mais a churn.
3 Customers with a low bank account balance are more likely to churn.
3 Clientes com saldo em conta bancária baixo tendem mais a churn.
4 Younger customers must cancel services more.
4 Clientes mais novos devem cancelar mais os serviçõs.
1 Male customers are more likely to churn.
1 Clientes do sexo masculino tendem mais a churn.
1 The greater the number of active years the customer has, the lower the risk of churn.
1 Quanto maior o numero de anos ativo o cliente tem, menor é risco de churn.
1 Churn rate should be higher for clients from French.
1 O churn rate deve ser maior para clientes da França.
1 Customers with only 1 product should experience higher churn.
1 Clientes com apenas 1 produto devriam ter churn maior.
1 Customers with higher wages should have a higher churn rate.
1 Clientes com salarios mais altos, devem ter um indice de churn maior.
2 Customers with a low bank account balance are more likely to churn.
2 Clientes com saldo em conta bancária baixo tendem mais a churn.
3 Younger customers must cancel services more.
3 Clientes mais novos devem cancelar mais os serviçõs.
4 Male customers are more likely to churn.
4 Clientes do sexo masculino tendem mais a churn.
5 The greater the number of active years the customer has, the lower the risk of churn.
5 Quanto maior o numero de anos ativo o cliente tem, menor é risco de churn.
6 Churn rate should be higher for clients from French.
6 O churn rate deve ser maior para clientes da França.
7 Customers with only 1 product should experience higher churn.
7 Clientes com apenas 1 produto devriam ter churn maior.
df1 <- data_raw
head(df1) %>% knitr::kable() %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
| row_number | customer_id | surname | credit_score | geography | gender | age | tenure | balance | num_of_products | has_cr_card | is_active_member | estimated_salary | exited |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
| 6 | 15574012 | Chu | 645 | Spain | Male | 44 | 8 | 113755.78 | 2 | 1 | 0 | 149756.71 | 1 |
# Create new features
df1 <- feature_engineering(df1)
# Visualize news features
t(head(df1)) %>% knitr::kable() %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
| row_number | 1 | 2 | 3 | 4 | 5 | 6 |
| customer_id | 15634602 | 15647311 | 15619304 | 15701354 | 15737888 | 15574012 |
| surname | Hargrave | Hill | Onio | Boni | Mitchell | Chu |
| credit_score | 619 | 608 | 502 | 699 | 850 | 645 |
| geography | France | Spain | France | France | Spain | Spain |
| gender | Female | Female | Female | Female | Female | Male |
| age | 42 | 41 | 42 | 39 | 43 | 44 |
| tenure | 2 | 1 | 8 | 1 | 2 | 8 |
| balance | 0.00 | 83807.86 | 159660.80 | 0.00 | 125510.82 | 113755.78 |
| num_of_products | 1 | 1 | 3 | 2 | 1 | 2 |
| has_cr_card | 1 | 0 | 1 | 0 | 1 | 1 |
| is_active_member | 1 | 1 | 0 | 0 | 1 | 0 |
| estimated_salary | 101348.88 | 112542.58 | 113931.57 | 93826.63 | 79084.10 | 149756.71 |
| exited | 1 | 0 | 1 | 0 | 0 | 1 |
| tenure_year | 3 | 2 | 9 | 2 | 3 | 9 |
| age_ten_year | 14.000000 | 20.500000 | 4.666667 | 19.500000 | 14.333333 | 4.888889 |
| cred_ten_year | 206.33333 | 304.00000 | 55.77778 | 349.50000 | 283.33333 | 71.66667 |
| cred_age | 14.73810 | 14.82927 | 11.95238 | 17.92308 | 19.76744 | 14.65909 |
| amount | 101348.88 | 196350.44 | 273592.37 | 93826.63 | 204594.92 | 263512.49 |
| amount_credit | 163.7300 | 322.9448 | 545.0047 | 134.2298 | 240.6999 | 408.5465 |
| amount_ten_year | 33782.96 | 98175.22 | 30399.15 | 46913.32 | 68198.31 | 29279.17 |
| amount_prod | 101348.88 | 196350.44 | 91197.46 | 46913.32 | 204594.92 | 131756.24 |
| cred_prod | 619.0000 | 608.0000 | 167.3333 | 349.5000 | 850.0000 | 322.5000 |
| bal_ten_year | 0.00 | 41903.93 | 17740.09 | 0.00 | 41836.94 | 12639.53 |
| prod_m_cr | 0 | 1 | 2 | 2 | 0 | 1 |
| prod_t_cr | 1 | 0 | 3 | 0 | 1 | 2 |
df2 <- df1
df2 %>%
mutate(exited = as.factor(exited)) %>%
dplyr::count(exited) %>%
mutate(prop = round(n/sum(n)*100,2)) %>%
ggplot2::ggplot(ggplot2::aes(x = exited, y = prop, color = exited)) +
ggsci::scale_color_jco() +
ggplot2::geom_segment(ggplot2::aes(xend = exited, yend = 0), show.legend = F) +
ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
fill = "white",
hjust = "inward",
show.legend = F) +
ggplot2::labs(y = "%",
x = "Exited", title = "Distribution Exited") +
ggplot2::coord_flip() +
ggplot2::theme_minimal()+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
df2 %>%
ggplot(aes(exited))+
geom_histogram(fill= "steelblue",col="black")+
scale_x_continuous(breaks = seq(0,1))
num_attributes %>%
purrr::keep(is.numeric) %>%
tidyr::gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(col= "black", fill="steelblue", bins = 25)+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
labs(title = "Distribution of numerical variables")+
theme(plot.title = element_text(hjust = 0.5, size = 18))
ggpubr::ggarrange(
df2 %>%
dplyr::count(gender) %>%
mutate(prop = round(n/sum(n)*100,2)) %>%
ggplot2::ggplot(ggplot2::aes(x = gender, y = prop, color = gender)) +
ggsci::scale_color_jco() +
ggplot2::geom_segment(ggplot2::aes(xend = gender, yend = 0), show.legend = F) +
ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
fill = "white",
hjust = "inward",
show.legend = F) +
ggplot2::labs(y = "%",
x = "Gender", title = "Gender") +
ggplot2::coord_flip() +
ggplot2::theme_minimal()+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),
df2 %>%
dplyr::count(geography) %>%
mutate(prop = round(n/sum(n)*100,2)) %>%
ggplot2::ggplot(ggplot2::aes(x = geography, y = prop, color = geography)) +
ggsci::scale_color_jco() +
ggplot2::geom_segment(ggplot2::aes(xend = geography, yend = 0), show.legend = F) +
ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
fill = "white",
hjust = "inward",
show.legend = F) +
ggplot2::labs(y = "%",
x = "Geography", title = "Geography") +
ggplot2::coord_flip() +
ggplot2::theme_minimal()+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),
ncol = 2)
df2 %>%
group_by(tenure) %>%
summarise(exited = sum(exited), .groups = "drop") %>%
highcharter::hchart(
'line', highcharter::hcaes(x = tenure, y = exited),
color = "steelblue"
)
round(cor(df2$tenure, df2$exited),2)
[1] -0.01
Weak and neagtive correlation.
Correlação fraca e negativa.
The highest churn rate is in the first year, followed by the ninth year.
O maior indice de churn se da no primeiro ano , seguido do nono ano.
H1 Customers with higher wages should have a higher churn rate.
H1 Clientes com salarios mais altos, devem ter um indice de churn maior.
True Customers with high wages, cancel more contracts, with wages from 140,000.00 to 200,000.00 euros .
Verdade Clientes com salarios elevados , cancelam mais os contratos,com salarios de 140,000.00 a 200,000.00 euros.
fig.2 <- df2 %>%
mutate(exited = as.factor(exited)) %>%
ggplot(aes(estimated_salary, fill = exited))+
geom_density(alpha = 0.3)+
labs(title = "Distrution Estimate Salary / Exited")+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))+
scale_x_continuous(breaks = seq(0,200000,10000))+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplotly(fig.2) %>% layout(autosize = F, width = 900, height = 450)
round(cor(df2$estimated_salary, df2$exited),2)
[1] 0.01
Weak and positive correlation.
Correlação fraca e positiva.
H2 Customers with a low bank account balance are more likely to churn.
H2 Clientes com saldo em conta bancária baixo tendem mais a churn.
False Customers with higher amounts in account, cancel the contracts more, if you see this starting from 80,000.00 euros.
Falsa Clientes com valores maiores em conta , cancelam mais os contratos, se ve isso apartir de 80,000.00 euros.
fig.3 <- df2 %>%
mutate(exited = as.factor(exited)) %>%
ggplot(aes(balance, fill= exited))+
geom_density(alpha= 0.3)+
labs(title = "Distrution Balance / Exited")+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))+
scale_x_continuous(breaks = seq(0,250000,20000))
ggplotly(fig.3) %>% layout(autosize = F, width = 900, height = 450)
round(cor(df2$balance, df2$exited),2)
[1] 0.12
Weak and positive correlation.
Correlação fraca e positiva.
H3 Younger customers must cancel services more.
H3 Clientes mais novos devem cancelar mais os serviçõs.
False Customers start canceling contracts after the age of 42.
Falsa Os clientes começam a cancelar os contratos apartir dos 42 anos.
fig.4 <- df2 %>%
mutate(exited = as.factor(exited)) %>%
ggplot(aes(age, fill= exited))+
geom_density(alpha= 0.3)+
labs(title = "Distrution Age / Exited")+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))+
scale_x_continuous(breaks = seq(0,80,3))
ggplotly(fig.4) %>% layout(autosize = F, width = 900, height = 450)
round(cor(df2$age, df2$exited),2)
[1] 0.29
Weak and positive correlation.
Correlação fraca e positiva.
H4 Male customers are more likely to churn.
H4 Clientes do sexo masculino tendem mais a churn.
False Male customers represent 9% of 10,000.00 churns
Falsa Clientes do sexo masculino representam 9% dos 10,000.00 churns
fig.5 <- df2 %>%
mutate(exited = as.factor(exited)) %>%
ggplot(aes(gender, fill= exited))+
geom_bar(col = "black")+
labs(title = "Distrution Gender / Exited")+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
ggplotly(fig.5) %>% layout(autosize = F, width = 900, height = 450)
Churn quantity between male and female. Quantidade churn entre os sexos masculino e feminino.
table(df2$gender, df2$exited)
0 1
Female 3404 1139
Male 4559 898
Proportion of churn between males and females.
Proporção de churn entre os sexos masculino e feminino.
prop.table(table(df2$gender, df2$exited))*100
0 1
Female 34.04 11.39
Male 45.59 8.98
H5 The greater the number of active years the customer has, the lower the risk of churn.
H5 Quanto maior o numero de anos ativo o cliente tem, menor é risco de churn.
False Over the years, the rate remains approximately balanced.
Falsa Com passar dos anos a taxa se mantem aproximadamente equilibrada.
fig.6 <- df2 %>%
mutate(exited = as.factor(exited),
tenure = as.factor(tenure)) %>%
ggplot(aes(tenure, fill= exited))+
geom_bar(col = "black")+
labs(title = "Distrution Tenure / Exited")+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
ggplotly(fig.6) %>% layout(autosize = F, width = 900, height = 450)
Proportion of churn over the years for active customers.
Proporção de churn ao longo dos anos para clientes ativos.
(prop.table(table(df2$tenure, df2$exited))*100)
0 1
0 3.18 0.95
1 8.03 2.32
2 8.47 2.01
3 7.96 2.13
4 7.86 2.03
5 8.03 2.09
6 7.71 1.96
7 8.51 1.77
8 8.28 1.97
9 7.71 2.13
10 3.89 1.01
H6 Churn rate should be higher for clients from French.
H6 A taxa de churn deve ser maior para clientes da França.
False Because Germany is a country with a higher churn rate, although this difference with France is small.
Falsa Pois a Alemanha é pais com maior taxa de churn, embora essa diferença com a frança seja pequena.
fig.7 <- df2 %>%
mutate(exited = as.factor(exited)
) %>%
ggplot(aes(geography, fill= exited))+
geom_bar(col = "black")+
labs(title = "Distrution Geographi / Exited")+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
ggplotly(fig.7) %>% layout(autosize = F, width = 900, height = 450)
Proportion of churn by country.
Proporção de churn por Pais.
prop.table(table(df2$geography, df2$exited))*100
0 1
France 42.04 8.10
Germany 16.95 8.14
Spain 20.64 4.13
H7 Customers with only 1 product should experience higher churn.
H7 Clientes com apenas 1 produto devriam ter churn maior.
True Customers who have only one product have a higher churn rate.
Verdade Clientes que possuem apenas um produto tem um idíce de churn maior.
fig.8 <- df2 %>%
mutate(exited = as.factor(exited)
) %>%
ggplot(aes(num_of_products, fill= exited))+
geom_bar(col = "black")+
labs(title = "Distrution Number of Products / Exited")+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
ggplotly(fig.8) %>% layout(autosize = F, width = 900, height = 450)
round(cor(df2$num_of_products, df2$exited),2)
[1] -0.05
Weak and negative correlation.
Correlação fraca e negativa.
df2 %>%
purrr::keep(is.numeric) %>%
cor() %>%
ggcorrplot::ggcorrplot(hc.order = T,
type = "lower",
lab=T,
lab_size = 3,
method = "square",
colors = c("chocolate1","white","darkcyan"),
ggtheme = theme_minimal())
df2 %>%
purrr::keep(is.factor) %>%
as.data.frame() %>%
catcor(type="cramer") %>%
ggcorrplot::ggcorrplot(hc.order = T,
type = "lower",
lab=T,
lab_size = 3,
method = "square",
colors = c("chocolate1","white","steelblue"),
ggtheme = theme_minimal())
df2 <- df2 %>% mutate(exited = ifelse(exited == 1, "yes","no"),
exited= as.factor(exited))
set.seed(1234)
# data division, where 80% is for training and 20% for testing.
data_split <- rsample::initial_split(df2, prop = 0.8, strata = exited)
# getting dataset of training
train_data <- rsample::training(data_split)
# getting dataset of testing
test_data <- rsample::testing(data_split)
# Cross-validation
cv <- rsample::vfold_cv(train_data, strata = exited)
summary statistical
# list of features type intenger
# "credit_score" <- integer
# "age" <- integer
# "tenure" <- integer
# "balance" <- integer
# "num_of_products" <- integer
# "has_cr_card" <- integer
# "is_active_member" <- integer
# "estimated_salary" <- integer
# "prod_m_cr" <- integer
# "prod_t_cr" <- integer
df2 %>%
purrr::keep(is.integer) %>% select(-row_number, -customer_id) %>% summary()
credit_score age tenure num_of_products
Min. :350.0 Min. :18.00 Min. : 0.000 Min. :1.00
1st Qu.:584.0 1st Qu.:32.00 1st Qu.: 3.000 1st Qu.:1.00
Median :652.0 Median :37.00 Median : 5.000 Median :1.00
Mean :650.5 Mean :38.92 Mean : 5.013 Mean :1.53
3rd Qu.:718.0 3rd Qu.:44.00 3rd Qu.: 7.000 3rd Qu.:2.00
Max. :850.0 Max. :92.00 Max. :10.000 Max. :4.00
has_cr_card is_active_member prod_m_cr prod_t_cr
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00
Median :1.0000 Median :1.0000 Median :1.0000 Median :1.00
Mean :0.7055 Mean :0.5151 Mean :0.8247 Mean :1.08
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.00
Max. :1.0000 Max. :1.0000 Max. :4.0000 Max. :4.00
Checking if it has an outlier or not
df2 %>%
purrr::keep(is.integer) %>% select(-row_number, -customer_id, -is_active_member, - has_cr_card, -num_of_products) %>%
tidyr::gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot(col= "black", fill="steelblue", bins = 25)+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
labs(title = "Distribution of numerical variables integer")+
theme(plot.title = element_text(hjust = 0.5, size = 18))
summary statistical
# list of features type double
# "tenure_year" <- double
# "age_ten_year" <- double
# "cred_ten_year" <- double
# "cred_age" <- double
# "amount" <- double
# "amount_credit" <- double
# "amount_ten_year" <- double
# "amount_prod" <- double
# "cred_prod" <- double
# "bal_ten_year" <- double
df2 %>%
purrr::keep(is.double) %>% summary()
balance estimated_salary tenure_year age_ten_year
Min. : 0 Min. : 11.58 Min. : 1.000 Min. : 1.636
1st Qu.: 0 1st Qu.: 51002.11 1st Qu.: 4.000 1st Qu.: 4.400
Median : 97199 Median :100193.91 Median : 6.000 Median : 6.429
Mean : 76486 Mean :100090.24 Mean : 6.013 Mean : 9.456
3rd Qu.:127644 3rd Qu.:149388.25 3rd Qu.: 8.000 3rd Qu.:11.000
Max. :250898 Max. :199992.48 Max. :11.000 Max. :79.000
cred_ten_year cred_age amount amount_credit
Min. : 31.82 Min. : 4.857 Min. : 90.1 Min. : 0.1356
1st Qu.: 76.38 1st Qu.:14.089 1st Qu.:117726.8 1st Qu.:179.2684
Median :108.59 Median :17.286 Median :177123.0 Median :271.0889
Mean :157.59 Mean :17.874 Mean :176576.1 Mean :277.8788
3rd Qu.:186.00 3rd Qu.:20.962 3rd Qu.:241020.1 3rd Qu.:375.9885
Max. :850.00 Max. :46.889 Max. :407730.8 Max. :984.7171
amount_ten_year amount_prod cred_prod bal_ten_year
Min. : 10.7 Min. : 45 Min. : 94.0 Min. : 0
1st Qu.: 17674.0 1st Qu.: 68215 1st Qu.:323.0 1st Qu.: 0
Median : 29519.9 Median :122878 Median :443.0 Median : 13281
Mean : 42998.4 Mean :137535 Mean :486.4 Mean : 18731
3rd Qu.: 50322.6 3rd Qu.:199910 3rd Qu.:653.0 3rd Qu.: 24638
Max. :373236.3 Max. :400348 Max. :850.0 Max. :197042
Checking if it has an outlier or not
df2 %>%
purrr::keep(is.double)%>%
tidyr::gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot(col= "black", fill="steelblue", bins = 25)+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
labs(title = "Distribution of numerical variables doubles")+
theme(plot.title = element_text(hjust = 0.5, size = 18))
# "surname" <- factor
# "geography" <- factor
# "gender" <- factor
# "exited" <- factor
df2 %>% purrr::keep(is.factor)%>% summary()
surname geography gender exited
Smith : 32 France :5014 Female:4543 no :7963
Martin : 29 Germany:2509 Male :5457 yes:2037
Scott : 29 Spain :2477
Walker : 28
Brown : 26
Genovese: 25
(Other) :9831
The rescaling methods applied below are based on the features distribution shape and boxplot outlier analysis.
Os métodos de reescalonamento aplicados a seguir são baseados na forma de distribuição das features e na análise de outlier de boxplot.
table_rescaling <- tibble(age = "outlier" , credit_score = "outlier" , prod_m_cr = "outlier" ,
prod_t_cr = "no outlier",tenure = "no outlier", age_ten_year = "outlier",
amount = "no outlier", amount_credit = "outlier",amount_prod = "outlier",
amount_ten_year = "outlier", bal_ten_year = "outlier" , balance = "no outlier",
cred_age = "outlier" , cred_prod = "no outlier",cred_ten_year = "outlier",
estimated_salary = "no outlier" ,tenure_year = "no outlier")
table_rescaling <- table_rescaling %>% tidyr::gather() %>% rename(features = key,
actions = value)
knitr::kable(table_rescaling) %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
| features | actions |
|---|---|
| age | outlier |
| credit_score | outlier |
| prod_m_cr | outlier |
| prod_t_cr | no outlier |
| tenure | no outlier |
| age_ten_year | outlier |
| amount | no outlier |
| amount_credit | outlier |
| amount_prod | outlier |
| amount_ten_year | outlier |
| bal_ten_year | outlier |
| balance | no outlier |
| cred_age | outlier |
| cred_prod | no outlier |
| cred_ten_year | outlier |
| estimated_salary | no outlier |
| tenure_year | no outlier |
rec_imbalanced <- recipes::recipe(exited ~. , train_data %>% select(-surname)) %>%
# Removing the features, row_number, customer_id and surname.
recipes::step_rm(row_number, customer_id) %>%
# normalizing features
recipes::step_mutate(age = robust_scaler(age)) %>%
recipes::step_mutate(credit_score = robust_scaler(credit_score)) %>%
recipes::step_mutate(prod_m_cr = robust_scaler(prod_m_cr)) %>%
recipes::step_mutate(prod_t_cr= minmax_scaler(prod_t_cr)) %>%
recipes::step_mutate(tenure = minmax_scaler(tenure)) %>%
recipes::step_mutate(age_ten_year = robust_scaler(age_ten_year)) %>%
recipes::step_mutate(amount = minmax_scaler(amount)) %>%
recipes::step_mutate(amount_credit = robust_scaler(amount_credit)) %>%
recipes::step_mutate(amount_prod = robust_scaler(amount_prod)) %>%
recipes::step_mutate(amount_ten_year = robust_scaler(amount_ten_year)) %>%
recipes::step_mutate(bal_ten_year = robust_scaler(bal_ten_year)) %>%
recipes::step_mutate(balance = minmax_scaler(balance)) %>%
recipes::step_mutate(cred_age = robust_scaler(cred_age)) %>%
recipes::step_mutate(cred_prod = minmax_scaler(cred_prod)) %>%
recipes::step_mutate(cred_ten_year = robust_scaler(cred_ten_year)) %>%
recipes::step_mutate(estimated_salary = minmax_scaler(estimated_salary)) %>%
recipes::step_mutate(tenure_year = minmax_scaler(tenure_year)) %>%
# turning categorical features into numerics.
recipes::step_dummy(recipes::all_nominal(),- recipes::all_outcomes(), one_hot = T)
rec_balanced <- recipes::recipe(exited ~. , train_data %>% select(-surname)) %>%
# Removing the features, row_number, customer_id and surname.
recipes::step_rm(row_number, customer_id) %>%
# normalizing features
recipes::step_mutate(age = robust_scaler(age)) %>%
recipes::step_mutate(credit_score = robust_scaler(credit_score)) %>%
recipes::step_mutate(prod_m_cr = robust_scaler(prod_m_cr)) %>%
recipes::step_mutate(prod_t_cr= minmax_scaler(prod_t_cr)) %>%
recipes::step_mutate(tenure = minmax_scaler(tenure)) %>%
recipes::step_mutate(age_ten_year = robust_scaler(age_ten_year)) %>%
recipes::step_mutate(amount = minmax_scaler(amount)) %>%
recipes::step_mutate(amount_credit = robust_scaler(amount_credit)) %>%
recipes::step_mutate(amount_prod = robust_scaler(amount_prod)) %>%
recipes::step_mutate(amount_ten_year = robust_scaler(amount_ten_year)) %>%
recipes::step_mutate(bal_ten_year = robust_scaler(bal_ten_year)) %>%
recipes::step_mutate(balance = minmax_scaler(balance)) %>%
recipes::step_mutate(cred_age = robust_scaler(cred_age)) %>%
recipes::step_mutate(cred_prod = minmax_scaler(cred_prod)) %>%
recipes::step_mutate(cred_ten_year = robust_scaler(cred_ten_year)) %>%
recipes::step_mutate(estimated_salary = minmax_scaler(estimated_salary)) %>%
recipes::step_mutate(tenure_year = minmax_scaler(tenure_year)) %>%
# turning categorical features into numerics.
recipes::step_dummy(recipes::all_nominal(),- recipes::all_outcomes(), one_hot = T) %>%
themis::step_smote(exited)
# Imbalanced and pre-processed training data
train_data_imbalanced <- rec_imbalanced %>% recipes::prep(train_data) %>% juice()
# Imbalanced and pre-processed testing data
test_data_imbalanced <- rec_imbalanced %>% recipes::prep(test_data) %>% juice()
# Balanced and pre-processed training data
train_data_balanced <- rec_balanced %>% recipes::prep(train_data) %>% juice()
# Balanced and pre-processed training data
test_data_balanced <- rec_balanced %>% recipes::prep(test_data) %>% juice()
ggpubr::ggarrange(
train_data_imbalanced %>%
ggplot(aes(exited, fill= exited))+
geom_bar(col="black")+
labs(title = "Imbalanced Data")+
theme(plot.title = element_text(hjust = 0.5, size = 18)),
train_data_balanced %>%
ggplot(aes(exited, fill= exited))+
geom_bar(col="black")+
labs(title = "Balanced Data")+
theme(plot.title = element_text(hjust = 0.5, size = 18)) , ncol = 2)
train_data_imbalanced %>%
purrr::keep(is.numeric) %>%
tidyr::gather() %>%
ggplot2::ggplot(ggplot2::aes(value)) +
ggplot2::facet_wrap(~ key, scales = "free") +
ggplot2::geom_histogram(col= "black", fill="steelblue", bins = 25)+
ggplot2::scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
ggplot2::labs(title = "Distribution of numeric variables after pre-processing")+
ggplot2::scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
#boruta <- Boruta::Boruta(exited ~.,data = train_data_balanced %>% select(-row_number,-customer_id) ,doTrace =2 )
#saveRDS(boruta, "Boruta/boruta.rds")
boruta <- readRDS("Boruta/boruta.rds")
features_boruta <- Boruta::attStats(boruta) %>% arrange(desc(meanImp))
features_boruta <- as.data.frame(data.table::setDT(features_boruta, keep.rownames = "features"))
features_boruta
features meanImp medianImp minImp maxImp normHits decision
1 age 75.62481 76.27921 71.55115 79.76610 1 Confirmed
2 is_active_member 58.01243 57.49542 56.77516 59.63506 1 Confirmed
3 cred_age 44.24553 44.69362 42.19361 45.38396 1 Confirmed
4 num_of_products 41.06186 41.63474 38.39232 42.39871 1 Confirmed
5 balance 39.06594 38.80633 37.04052 42.26464 1 Confirmed
6 cred_ten_year 38.18778 38.19954 35.61786 41.10785 1 Confirmed
7 amount_ten_year 37.15793 36.78392 34.18727 41.11690 1 Confirmed
8 age_ten_year 36.34803 36.42665 35.15741 37.48979 1 Confirmed
9 credit_score 34.86733 34.96885 31.92939 38.70913 1 Confirmed
10 bal_ten_year 34.69190 34.57109 33.07699 36.42140 1 Confirmed
11 estimated_salary 34.44135 34.42287 30.47675 38.42563 1 Confirmed
12 amount_credit 34.10562 34.03872 31.38059 37.15669 1 Confirmed
13 cred_prod 33.19357 33.10060 32.07205 35.25670 1 Confirmed
14 geography_Germany 31.53729 31.28621 29.57731 33.43443 1 Confirmed
15 tenure 30.14478 30.19744 27.83968 31.97109 1 Confirmed
16 tenure_year 29.79353 29.77035 27.55277 31.81791 1 Confirmed
17 amount_prod 29.49149 29.68004 27.39199 30.97616 1 Confirmed
18 amount 28.14608 28.03218 25.01640 32.38418 1 Confirmed
19 prod_m_cr 24.79515 24.76330 22.69695 26.57305 1 Confirmed
20 gender_Male 22.96577 22.69095 22.07488 24.49463 1 Confirmed
21 gender_Female 22.87308 22.92502 21.23916 24.94434 1 Confirmed
22 prod_t_cr 22.08248 22.18046 20.81962 22.90126 1 Confirmed
23 geography_France 18.53242 18.69797 16.66298 20.28926 1 Confirmed
24 geography_Spain 17.98637 17.96556 17.13462 18.80484 1 Confirmed
25 has_cr_card 12.85795 13.03965 11.54596 14.55708 1 Confirmed
features_boruta %>%
ggplot(aes(stats::reorder (features ,desc(meanImp)), meanImp))+
geom_bar(stat="identity", fill="steelblue", col="black")+
labs(title = "Importance of Boruta Features", x= "features")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))
# Creating model
lr <- logistic_reg() %>%
set_engine("glm")
# Trainnig model
lr_fit <- lr %>% fit(exited ~., data= train_data_imbalanced)
lr_wf <- workflow() %>%
add_model(lr) %>%
add_recipe(rec_imbalanced)
lr_res <-
last_fit(
lr_wf,
split = data_split,
metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
)
lr_result <- data.frame(t(collect_metrics(lr_res) %>% select(.metric, .estimate)))
rownames(lr_result) <- NULL
colnames(lr_result) <- lr_result[1,]
lr_result <- lr_result[-1, ]
lr_result <- data.frame(Model = "Logistic Regression - Imbalanced", lr_result)
lr_result %>% knitr::kable(caption = "Table Metrics Single - Imbalanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | precision | recall | f_meas | mcc | kap | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Logistic Regression - Imbalanced | 0.8444222 | 0.8460292 | 0.9836683 | 0.9096718 | 0.4358823 | 0.3713792 | 0.8296658 |
#lr_fit_cv <- lr_wf %>%
# fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
# control = control_resamples(save_pred = T))
#saveRDS(lr_fit_cv, "Models/lr_fit_cv.rds")
lr_fit_cv <- readRDS("Models/lr_fit_cv.rds")
lr_result_cv <- data.frame(t(collect_metrics(lr_fit_cv) %>% select(.metric, mean)))
rownames(lr_result_cv) <- NULL
colnames(lr_result_cv) <- lr_result_cv[1,]
lr_result_cv <- lr_result_cv[-1, ]
lr_result_cv <- data.frame(Model = "Logistic Regression - Imbalanced", lr_result_cv)
lr_result_cv %>% knitr::kable(caption = "Table Metrics Cross Validation - Imbalanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | f_meas | kap | mcc | precision | recall | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Logistic Regression - Imbalanced | 0.8315229 | 0.9023105 | 0.3140080 | 0.3738247 | 0.8384426 | 0.9769265 | 0.8158259 |
lr_fit_cv %>%
unnest(.predictions) %>%
conf_mat(exited, .pred_class) %>%
autoplot(type = "heatmap")
# Creating model
rf <- rand_forest(trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
# Trainnig model
#rf_fit <- rf %>% fit(exited ~., data= train_data_imbalanced)
#saveRDS(rf_fit, "Models/rf_fit.rds")
rf_fit <- readRDS("Models/rf_fit.rds")
rf_wf <- workflow() %>%
add_model(rf) %>%
add_recipe(rec_imbalanced)
# rf_res <-
# last_fit(
# rf_wf,
# split = data_split,
# metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
# )
#saveRDS(rf_res,"Models/rf_res.rds")
rf_res <- readRDS("Models/rf_res.rds")
rf_result <- data.frame(t(collect_metrics(rf_res) %>% select(.metric, .estimate)))
rownames(rf_result) <- NULL
colnames(rf_result) <- rf_result[1,]
rf_result <- rf_result[-1, ]
rf_result <- data.frame(Model = "Random Forest - Imbalanced", rf_result)
rf_result %>% knitr::kable(caption = "Table Metrics Single - Imbalanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | precision | recall | f_meas | mcc | kap | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Random Forest - Imbalanced | 0.8709355 | 0.8824541 | 0.9667085 | 0.9226619 | 0.5588921 | 0.5377695 | 0.8560678 |
# rf_fit_cv <- rf_wf %>%
# fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
# control = control_resamples(save_pred = T))
#saveRDS(rf_fit_cv, "Models/rf_fit_cv.rds")
rf_fit_cv <- readRDS("Models/rf_fit_cv.rds")
rf_result_cv <- data.frame(t(collect_metrics(rf_fit_cv) %>% select(.metric, mean)))
rownames(rf_result_cv) <- NULL
colnames(rf_result_cv) <- rf_result_cv[1,]
rf_result_cv <- rf_result_cv[-1, ]
rf_result_cv <- data.frame(Model = "Random Forest - Imbalanced", rf_result_cv)
rf_result_cv %>% knitr::kable(caption = "Table Metrics Cross Validation - Imbalanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | f_meas | kap | mcc | precision | recall | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Random Forest - Imbalanced | 0.8557690 | 0.9141315 | 0.4709430 | 0.4980182 | 0.8690588 | 0.9642139 | 0.8434486 |
rf_fit_cv %>%
unnest(.predictions) %>%
conf_mat(exited, .pred_class) %>%
autoplot(type = "heatmap")
# Creating model
lr_bal <- logistic_reg() %>%
set_engine("glm")
# Trainnig model
lr_fit_bal <- lr_bal %>% fit(exited ~., data= train_data_balanced)
lr_wf_bal <- workflow() %>%
add_model(lr_bal) %>%
add_recipe(rec_balanced)
lr_res_bal <-
last_fit(
lr_wf_bal,
split = data_split,
metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
)
lr_result_bal <- data.frame(t(collect_metrics(lr_res_bal) %>% select(.metric, .estimate)))
rownames(lr_result_bal) <- NULL
colnames(lr_result_bal) <- lr_result_bal[1,]
lr_result_bal <- lr_result_bal[-1, ]
lr_result_bal <- data.frame(Model = "Logistic Regression - Balanced", lr_result_bal)
lr_result_bal %>% knitr::kable(caption = "Table Metrics Single - Balanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | precision | recall | f_meas | mcc | kap | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Logistic Regression - Balanced | 0.8034017 | 0.9075459 | 0.8385678 | 0.8716944 | 0.4607166 | 0.4541640 | 0.8293016 |
# lr_fit_cv_bal <- lr_wf_bal %>%
# fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
# control = control_resamples(save_pred = T))
#saveRDS(lr_fit_cv_bal, "Models/lr_fit_cv_bal.rds")
lr_fit_cv_bal <- readRDS("Models/lr_fit_cv_bal.rds")
lr_result_cv_bal <- data.frame(t(collect_metrics(lr_fit_cv_bal) %>% select(.metric, mean)))
rownames(lr_result_cv_bal) <- NULL
colnames(lr_result_cv_bal) <- lr_result_cv_bal[1,]
lr_result_cv_bal <- lr_result_cv_bal[-1, ]
lr_result_cv_bal <- data.frame(Model = "Logistic Regression - Balanced", lr_result_cv_bal)
lr_result_cv_bal %>% knitr::kable(caption = "Table Metrics Cross Validation - Balanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | f_meas | kap | mcc | precision | recall | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Logistic Regression - Balanced | 0.7749071 | 0.8498889 | 0.4036957 | 0.4162602 | 0.9038824 | 0.8031720 | 0.8150445 |
lr_fit_cv_bal %>%
unnest(.predictions) %>%
conf_mat(exited, .pred_class) %>%
autoplot(type = "heatmap")
# Creating model
rf_bal <- rand_forest(trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
# Trainnig model
#rf_fit_bal <- rf_bal %>% fit(exited ~., data= train_data_balanced)
#saveRDS(rf_fit_bal, "Models/rf_fit_bal.rds")
rf_fit_bal <- readRDS("Models/rf_fit_bal.rds")
rf_wf_bal <- workflow() %>%
add_model(rf_bal) %>%
add_recipe(rec_balanced)
# rf_res_bal <-
# last_fit(
# rf_wf_bal,
# split = data_split,
# metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
# )
#saveRDS(rf_res_bal,"Models/rf_res_bal.rds")
rf_res_bal <- readRDS("Models/rf_res_bal.rds")
rf_result_bal <- data.frame(t(collect_metrics(rf_res_bal) %>% select(.metric, .estimate)))
rownames(rf_result_bal) <- NULL
colnames(rf_result_bal) <- rf_result_bal[1,]
rf_result_bal <- rf_result_bal[-1, ]
rf_result_bal <- data.frame(Model = "Random forest - Balanced", rf_result_bal)
rf_result_bal %>% knitr::kable(caption = "Table Metrics Single - Balanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | precision | recall | f_meas | mcc | kap | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Random forest - Balanced | 0.8444222 | 0.9000625 | 0.9051508 | 0.9025994 | 0.5163299 | 0.5162793 | 0.8532605 |
# rf_fit_cv_bal <- rf_wf_bal %>%
# fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
# control = control_resamples(save_pred = T))
#saveRDS(rf_fit_cv_bal, "Models/rf_fit_cv_bal.rds")
rf_fit_cv_bal <- readRDS("Models/rf_fit_cv_bal.rds")
rf_result_cv_bal <- data.frame(t(collect_metrics(rf_fit_cv_bal) %>% select(.metric, mean)))
rownames(rf_result_cv_bal) <- NULL
colnames(rf_result_cv_bal) <- rf_result_cv_bal[1,]
rf_result_cv_bal <- rf_result_cv_bal[-1, ]
rf_result_cv_bal <- data.frame(Model = "Random Forest - Balanced", rf_result_cv_bal)
rf_result_cv_bal %>% knitr::kable(caption = "Table Metrics Cross Validation - Balanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | f_meas | kap | mcc | precision | recall | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Random Forest - Balanced | 0.8385223 | 0.8993591 | 0.4908712 | 0.4916184 | 0.8926209 | 0.9062952 | 0.8394568 |
rf_fit_cv_bal %>%
unnest(.predictions) %>%
conf_mat(exited, .pred_class) %>%
autoplot(type = "heatmap", label.color = "blue")
# Creating model
xg <- boost_tree(trees = 500) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Trainnig model
#xg_fit <- xg %>% fit(exited ~., data= train_data_imbalanced)
#saveRDS(xg_fit, "Models/xg_fit.rds")
xg_fit <- readRDS("Models/xg_fit.rds")
xg_wf <- workflow() %>%
add_model(xg) %>%
add_recipe(rec_imbalanced)
# xg_res <-
# last_fit(
# xg_wf,
# split = data_split,
# metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
# )
#saveRDS(xg_res,"Models/xg_res.rds")
xg_res <- readRDS("Models/xg_res.rds")
xg_result <- data.frame(t(collect_metrics(xg_res) %>% select(.metric, .estimate)))
rownames(xg_result) <- NULL
colnames(xg_result) <- xg_result[1,]
xg_result <- xg_result[-1, ]
xg_result <- data.frame(Model = "Xgboost Model - Imbalanced", xg_result)
xg_result %>% knitr::kable(caption = "Table Metrics Single - Imbalanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | precision | recall | f_meas | mcc | kap | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Xgboost Model - Imbalanced | 0.8504252 | 0.8827709 | 0.9365578 | 0.9088692 | 0.5006732 | 0.4938782 | 0.8475516 |
# xg_fit_cv <- xg_wf %>%
# fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
# control = control_resamples(save_pred = T))
#saveRDS(xg_fit_cv, "Models/xg_fit_cv.rds")
xg_fit_cv <- readRDS("Models/xg_fit_cv.rds")
xg_result_cv <- data.frame(t(collect_metrics(xg_fit_cv) %>% select(.metric, mean)))
rownames(xg_result_cv) <- NULL
colnames(xg_result_cv) <- xg_result_cv[1,]
xg_result_cv <- xg_result_cv[-1, ]
xg_result_cv <- data.frame(Model = "Xgboost Model - Imbalanced", xg_result_cv)
xg_result_cv %>% knitr::kable(caption = "Table Metrics Cross Validation - Imbalanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | f_meas | kap | mcc | precision | recall | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Xgboost Model - Imbalanced | 0.8432714 | 0.9053775 | 0.4529159 | 0.4653514 | 0.8716130 | 0.9419253 | 0.8220846 |
xg_fit_cv %>%
unnest(.predictions) %>%
conf_mat(exited, .pred_class) %>%
autoplot(type = "heatmap")
# Creating model
xg_bal <- boost_tree(trees = 500) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Trainnig model
#xg_fit_bal <- xg_bal %>% fit(exited ~., data= train_data_balanced)
#saveRDS(xg_fit, "Models/xg_fit_bal.rds")
xg_fit_bal <- readRDS("Models/xg_fit_bal.rds")
xg_wf_bal <- workflow() %>%
add_model(xg_bal) %>%
add_recipe(rec_balanced)
# xg_res_bal <-
# last_fit(
# xg_wf_bal,
# split = data_split,
# metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
# )
#saveRDS(xg_res_bal,"Models/xg_res_bal.rds")
xg_res_bal <- readRDS("Models/xg_res_bal.rds")
xg_result_bal <- data.frame(t(collect_metrics(xg_res_bal) %>% select(.metric, .estimate)))
rownames(xg_result_bal) <- NULL
colnames(xg_result_bal) <- xg_result_bal[1,]
xg_result_bal <- xg_result_bal[-1, ]
xg_result_bal <- data.frame(Model = "Xgboost Model - Balanced", xg_result_bal)
xg_result_bal %>% knitr::kable(caption = "Table Metrics Single - Balanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | precision | recall | f_meas | mcc | kap | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Xgboost Model - Balanced | 0.8524262 | 0.8908981 | 0.9283920 | 0.9092587 | 0.5183932 | 0.5152418 | 0.8538809 |
# xg_fit_cv_bal <- xg_wf_bal %>%
# fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
# control = control_resamples(save_pred = T))
#saveRDS(xg_fit_cv_bal, "Models/xg_fit_cv_bal.rds")
xg_fit_cv_bal <- readRDS("Models/xg_fit_cv_bal.rds")
xg_result_cv_bal <- data.frame(t(collect_metrics(xg_fit_cv_bal) %>% select(.metric, mean)))
rownames(xg_result_cv_bal) <- NULL
colnames(xg_result_cv_bal) <- xg_result_cv_bal[1,]
xg_result_cv_bal <- xg_result_cv_bal[-1, ]
xg_result_cv_bal <- data.frame(Model = "Xgboost Model - Balanced", xg_result_cv_bal)
xg_result_cv_bal %>% knitr::kable(caption = "Table Metrics Cross Validation - Balanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | f_meas | kap | mcc | precision | recall | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Xgboost Model - Balanced | 0.8237736 | 0.8893934 | 0.4506173 | 0.4555440 | 0.8870763 | 0.8932634 | 0.8216237 |
xg_fit_cv_bal %>%
unnest(.predictions) %>%
conf_mat(exited, .pred_class) %>%
autoplot(type = "heatmap")
bind_rows(lr_result_cv, lr_result_cv_bal, rf_result_cv, rf_result_cv_bal, xg_result_cv, xg_result_cv_bal) %>%
knitr::kable(caption = "Performance Comparison Table for Models - Imbalanced and Balanced") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | f_meas | kap | mcc | precision | recall | roc_auc |
|---|---|---|---|---|---|---|---|
| Logistic Regression - Imbalanced | 0.8315229 | 0.9023105 | 0.3140080 | 0.3738247 | 0.8384426 | 0.9769265 | 0.8158259 |
| Logistic Regression - Balanced | 0.7749071 | 0.8498889 | 0.4036957 | 0.4162602 | 0.9038824 | 0.8031720 | 0.8150445 |
| Random Forest - Imbalanced | 0.8557690 | 0.9141315 | 0.4709430 | 0.4980182 | 0.8690588 | 0.9642139 | 0.8434486 |
| Random Forest - Balanced | 0.8385223 | 0.8993591 | 0.4908712 | 0.4916184 | 0.8926209 | 0.9062952 | 0.8394568 |
| Xgboost Model - Imbalanced | 0.8432714 | 0.9053775 | 0.4529159 | 0.4653514 | 0.8716130 | 0.9419253 | 0.8220846 |
| Xgboost Model - Balanced | 0.8237736 | 0.8893934 | 0.4506173 | 0.4555440 | 0.8870763 | 0.8932634 | 0.8216237 |
Looking at the accuracy we could say that the models with unbalanced data have better performance, this is not true, looking at the confusion matrix we see that the models are more correct when the class is “no” because it corresponds to approximately 80% of the database, considering that accuracy = number of correct predictions / total cases, unbalanced data have a high bias in the model, to evaluate the best model I will choose the Kappa or Cohen Kappa metrics is a statistical metric used to measure the models ’performance for items qualitative (categorical).
It is a more useful measure to use in problems that have an imbalance in the classes (for example, 70-30 divided for classes 0 and 1 and you can achieve 70% accuracy by predicting that all instances are for class 0).
So we can verify that the best set of data to be used will be with balanced data, even though this situation is obvious, I decided to illustrate to have a better understanding of why using the smote function in recipes to balance the data.
Olhando a precisão poderíamos dizer que os modelos com dados desbalanceados têm melhor desempenho, isso não é verdade, olhando a matriz de confusão vemos que os modelos ficam mais corretos quando a classe é “não” porque corresponde a aproximadamente 80% da base de dados, considerando que acurácia = número de previsões corretas / total de casos, dados não balanceados têm um viés alto no modelo, para avaliar o melhor modelo vou escolher as métricas Kappa ou Kappa de Cohen é uma métrica estatística usada para medir a iperformance de modelos para itens qualitativos (categóricos). É uma medida mais útil para usar em problemas que têm um desequilíbrio nas classes (por exemplo, 70-30 dividido para as classes 0 e 1 e você pode alcançar 70% de precisão prevendo que todas as instâncias são para a classe 0).
Sendo assim podemos verificar que o melhor conjunto de dados a ser ultiizado será com dados balanceados , mesmo essa situação sendo obvia , resolvi ilustrar para ter um melhro entendimento do porque usar a função smote no recipes para balacear os dados.
I will use the Random Forest model for this problem, as it had the best performance.
Vou usar o modelo Random Forest para este problema , pois ele teve a melhor performance.
# Specifying model to be tuned.
rf_spec <- rand_forest(
mtry = tune(),
trees = 3000,
min_n = tune()
) %>%
set_mode("classification") %>%
set_engine("ranger")
rf_wf_spec <- workflow() %>%
add_recipe(rec_balanced) %>%
add_model(rf_spec)
# set.seed(345)
#
# doParallel::registerDoParallel()
#
# tune_result <- tune_grid(
#
# rf_wf_spec,
# resamples = cv,
# grid = 10, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
# control = control_grid(save_pred = TRUE)
# )
#saveRDS(tune_result, "hyperparameters/tune_result.rds")
tune_result <- readRDS("hyperparameters/tune_result.rds")
tune_result %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
select(mean, min_n, mtry) %>%
pivot_longer(min_n:mtry,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "Kappa")
best_kappa <- select_best(tune_result, "kap")
best_kappa
# A tibble: 1 x 3
mtry min_n .config
<int> <int> <chr>
1 6 30 Preprocessor1_Model01
final_rf <- finalize_model(
rf_spec,
best_kappa
)
final_rf
Random Forest Model Specification (classification)
Main Arguments:
mtry = 6
trees = 3000
min_n = 30
Computational engine: ranger
final_wf <- workflow() %>%
add_recipe(rec_balanced) %>%
add_model(final_rf)
# final_res <- final_wf %>%
# last_fit(data_split, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap))
#saveRDS(final_res, "Models/final_res.rds")
final_res <- readRDS("Models/final_res.rds")
final_res <- data.frame(t(collect_metrics(final_res) %>% select(.metric, .estimate)))
rownames(final_res) <- NULL
colnames(final_res) <- final_res[1,]
final_res <- final_res[-1, ]
final_res <- data.frame(Model = "Random Forest - Final Model", final_res)
final_res %>% knitr::kable(caption = "Table Metrics Random Forest - Final Model") %>%
kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
| Model | accuracy | precision | recall | f_meas | mcc | kap | roc_auc | |
|---|---|---|---|---|---|---|---|---|
| 2 | Random Forest - Final Model | 0.8419210 | 0.9048223 | 0.8957286 | 0.9002525 | 0.5197302 | 0.5195764 | 0.8568457 |